GFSF: A Novel Similarity Join Method Based on Frequency Vector
نویسندگان
چکیده
String similarity join is widely used in many fields, e.g. data cleaning, web search, pattern recognition and DNA sequence matching. During the recent years, many similarity join methods have been proposed, for example Pass-Join, Ed-Join, Trie-Join, and so on, among which the Pass-Join algorithm based on edit distance can achieve much better overall performance than the others. But Pass-Join can not effectively filter those candidate pairs which are partially similar. Here a novel algorithm called GFSF is proposed, which introduces two additional filtering steps based on character frequency vector. Through this way, the number of pairs which are only partially similar are greatly reduced, thus greatly reducing the total time of string similarity join process. The experimental results show that the overall performance of the proposed method is better than Pass-Join.
منابع مشابه
Power-Law Based Estimation of Set Similarity Join Size
We propose a novel technique for estimating the size of set similarity join. The proposed technique relies on a succinct representation of sets using Min-Hash signatures. We exploit frequent patterns in the signatures for the Set Similarity Join (SSJoin) size estimation by counting their support. However, there are overlaps among the counts of signature patterns and we need to use the set Inclu...
متن کاملA novel cooperative game between client and subcontractors based on technical characteristics
Large projects often have several activities which are performed by some subcontractors with several skills. Costs and time reduction and quality improvement of the project are very important for client and subcontractors. Therefore, in real large projects, subcontractors join together and form coalitions for improving the project profit. A key question is how an extra profit of cooperation amo...
متن کاملA novel hybrid method for vocal fold pathology diagnosis based on russian language
In this paper, first, an initial feature vector for vocal fold pathology diagnosis is proposed. Then, for optimizing the initial feature vector, a genetic algorithm is proposed. Some experiments are carried out for evaluating and comparing the classification accuracies which are obtained by the use of the different classifiers (ensemble of decision tree, discriminant analysis and K-nearest neig...
متن کاملA Novel Method for Tracking Moving Objects using Block-Based Similarity
Extracting and tracking active objects are two major issues in surveillance and monitoring applications such as nuclear reactors, mine security, and traffic controllers. In this paper, a block-based similarity algorithm is proposed in order to detect and track objects in the successive frames. We define similarity and cost functions based on the features of the blocks, leading to less computati...
متن کاملA novel method for detecting structural damage based on data-driven and similarity-based techniques under environmental and operational changes
The applications of time series modeling and statistical similarity methods to structural health monitoring (SHM) provide promising and capable approaches to structural damage detection. The main aim of this article is to propose an efficient univariate similarity method named as Kullback similarity (KS) for identifying the location of damage and estimating the level of damage severity. An impr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016